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Molecular simulations as well as single molecule experiments have been widely analyzed in terms 
of order parameters, the latter representing candidate probes for the relevant degrees of freedom. 
Notwithstanding this approach is very intuitive, mounting evidence showed that such description 
is not accurate, leading to ambiguous definitions of states and wrong kinetics. To overcome these 
limitations a framework making use of order parameter fluctuations in conjunction with complex 
network analysis is investigated. Derived from recent advances in the analysis of single molecule time 
traces, this approach takes into account of the fluctuations around each time point to distinguish 
between states that have similar values of the order parameter but different dynamics. Snapshots 
with similar fluctuations are used as nodes of a transition network, the clusterization of which 
into states provides accurate Markov-State-Models of the system under study. Application of the 
methodology to theoretical models with a noisy order parameter as well as the dynamics of a 
disordered peptide illustrates the possibility to build accurate descriptions of molecular processes 
on the sole basis of order parameter time series without using any supplementary information. 



INTRODUCTION 

Order parameters are conventionally used for the char- 
acterization of complex molecular processes [TJ [2] . Inter- 
atomic distances or a combination of them are common 
choices, providing an intuitive description in terms of 
free-energy projections [3H7]. Unfortunately, it has been 
repeatedly found that reduced descriptions based on or- 
der parameters are often inaccurate [8-13 . The origin 
of the failure is due to overlaps in the order parameter 
distribution, i.e., configurations with different properties 
corresponding to the same value of the coordinate, mak- 
ing the discrimination between states ambiguous [TQ| fT2] . 
From the point of view of the dynamics, spurious recross- 
ings at the borders result in lower free-energy barriers and 
artificially faster kinetics [T4] . 

To improve on this situation, a new arsenal of tools 
emerged making use of complex networks and the theory 
of stochastic processes. Configuration- space- networks, 
referred as Markov- St ate-Models when the Markov prop- 
erty is satisfied, provide high resolution free-energy land- 
scapes of complex molecular processes [8] |9j [T5HT9] . The 
main idea behind this approach is to map the molecular 
dynamics onto a transition network. Nodes and links rep- 
resent sampled system configurations (micro states) and 
the transitions between them as observed in the molecu- 
lar dynamics, respectively. The resulting transition net- 
work stores the entire kinetical information in the form 
of link weights and node connectivity, providing a com- 
pact representation of the molecular trajectory. Within 
this approach, free-energy representations are obtained 
in a more universal way without using arbitrarily projec- 
tions on order parameters. Both thermodynamics and 
kinetics come from the analysis of the transition net- 
work with methods like network clusterization algorithms 
[15l[18j|20], network flow analysis [HHEO and spectral 



methods (151(17]. 

Besides the advantages, a general strategy towards mi- 
crostates building is still missing, making the initial map- 
ping of the molecular trajectory onto a network non- 
trivial. Even for the well-studied case of structured pep- 
tides and the folding of small proteins, there is no con- 
sensus on the best practice [19, 22 J. Moreover, a broad 
set of problems including molecular association [23j [24] , 
the analysis of intrinsically disordered proteins [25] [26] 
and liquids [27] [28] are very hard to tackle with the cur- 
rent methodology. As shown for the case of water, ad-hoc 
strategies are needed [29-31 . Ironically, many of these 
processes can be qualitatively described by the analysis 
of conventional order parameters. 

In this work, an effort is made to reconcile the intuitive 
aspect of order parameters with the predictive power of 
transition networks, overcoming some of the limitations 
of both methodologies. The strategy couples a recently 
developed framework for the analysis of single molecule 
traces [32-34 using network clusterization techniques in 
order to obtain accurate kinetic models from conven- 
tional order parameter time series. Applications to the- 
oretical models and molecular dynamics simulations of 
a disordered peptide are presented. Our results suggest 
a general approach to analyze molecular processes with 
high accuracy on the sole basis of conventional order pa- 
rameter time series. 



THEORY 

Configuration-space networks from conventional 
order parameters 

General motivation. Order parameters allow for intu- 
itive descriptions of molecular processes. Unfortunately, 
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FIG. 1. Local- fluctuations order parameter analysis, (a) The 
time series of an order parameter and its distribution (black 
lines). Two snapshots with the same value of the order pa- 
rameter but belonging to different states are characterized 
by distinct local distributions (orange and light blue, respec- 
tively), (b) A Kolmogorov-Smirnov test evaluates the simi- 
larity of the cumulative of the two distributions. Snapshots 
with similar distributions belong to the same microstate. (c) 
Microstates and the transitions between them represent nodes 
and links of a configuration-space- network, respectively. Net- 
work clusterization techniques allow the lumping of kineti- 
cally homogeneous regions of the network into states (orange 
and light blue areas) (d) States are used to build a reduced 
Markov-State-Model of the original molecular process (see the 
Theory section for further details). 



such descriptions can be highly inaccurate due to the 
presence of overlaps, i.e., configurations with different 
properties corresponding to the same value of the or- 
der parameter [10] [12]. An important improvement in 
this respect was the introduction of configuration-space- 
networks, providing accurate and concise descriptions of 
the system kinetics and thermodynamics [3J [9] H5UT9] . 
Their application however is still limited, lacking a gen- 
eral way to build the transition network. 

To overcome this impasse, a potential strategy makes 
use of a recently introduced framework for the analysis of 
single molecule experiments [32-34 . Local fluctuations 
of a given coordinate bring relevant information on the 
underlying free-energy surface. That is, two points with 
similar values of the coordinate but belonging to differ- 
ent states are characterized by distinct local distributions 
(see orange and blue areas in Fig. [IJi). In order to char- 
acterize the molecular process, this information can be 
used in different ways, going from the concept of state 
"candidate" based on escape times [32, 34 to cut-based 
free-energy profiles [33]. The latter approach proposed a 
way to build the system microstates based on the local 



fluctuations of an arbitrary inter-atomic distance, show- 
ing that the folding barrier and native state population 
of a small protein are correctly recovered. On the other 
hand, the limited information contained in a single dis- 
tance made the full reconstruction of the unfolded state 
hard. 

Here, local fluctuations are exploited towards a better 
characterization of order parameter time series, overcom- 
ing the problems raised by the presence of overlaps. The 
proposed protocol works as follows: (i) based on the time 
series of the order parameter a set of microstates is con- 
structed; (ii) the resulting configuration-space-network is 
built; (iii) the presence of states is found by performing a 
network clusterization algorithm; (iv) a reduced kinetic 
model is built based on the found states. These four steps 
are described in detail below. 

Microstate building. The microstates accounting for 
the local fluctuations of the order parameter were built 
as suggested in Ref. [33]. As such, each time point of the 
trajectory U was associated with a corresponding time 
window [ti — r/2, t\ + r/2]. Two time points were consid- 
ered to be similar if they have comparable distributions 
of the order parameter. Snapshot similarity, D, was esti- 
mated by comparing the cumulative of the two distribu- 
tions via a Kolmogorov-Smirnov test [35 which checks 
whether two samples belong to the same distribution or 
not (Fig. [TJd). D was defined as the maximum differ- 
ence of the two cumulative distributions. Two samples 
belong to the same distribution, and thus to the same mi- 
crostate, if the condition D < was fulfilled. The 
acceptance cutoff ( corresponds to a certain confidence 
level. Being r and ( related, we fixed the latter value 
to 0.5 and let r vary. Comparisons were made along the 
trajectory using the leader algorithm in a way that each 
time point was associated to a microstate at the end of 
the procedure [33] [36] . 

The configuration- space-network. The resulting time 
series of microstates was mapped onto a configuration- 
space- network (Fig. Microstates represent network 
nodes and a link between them exists if they were suc- 
cessively visited along the molecular trajectory. For each 
link detailed balance was imposed by making an average 
of the number of transitions in both directions. 

Network clusterization. A clusterization algorithm 
was applied for the analysis of the configuration-space- 
network in order to detect the presence of states. In fact, 
free-energy basins are represented as densely connected 
regions of the configuration-space-network [15]. It was 
shown [15 , 18] that those regions can be automatically de- 
tected by using the Markov- Clustering- Algorithm (MCL) 
for network clusterization [37] . This approach is based on 
the evolution of random walkers on the network, resulting 
in a kinetically accurate network splitting. Hence, dy- 
namical interconversions within a cluster are faster than 
transitions to other regions of the network. Being sepa- 
rated by barriers, network clusters represent free-energy 
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basins (i.e., the states) of the system (orange and blue ar- 
eas in Fig. [1J). A parameter p > 1 tunes the granularity 
of the clusterization. Larger values of the parameter (e.g. 
p > 1.5) result in an increased number of clusters while 
the most relevant states (i.e. separated by the highest 
barriers) are already detected with p between 1.2 and 1.4 

mans]. 

Reduced kinetic model. The most populated clusters 
were used as states of a reduced Markov-State-Model. 
Transition probabilities were estimated from the original 
transition network by summing up all the links connect- 
ing any two states (orange and blue areas in Fig.[l]>d). It 
is worth noting that while this reduced kinetic model sat- 
isfies the Markov property [19], this is not generally the 
case for the starting transition network. (This property 
is in any case not needed when it comes to network clus- 
terization [EJ [18].) The accuracy of the kinetic model 
was estimated with the use of first-passage-time distri- 
butions. 



METHODS 
Two-state model 

A stochastic two-state model with transition proba- 
bility Pij = 0.01 was built. The latter probability com- 
pletely defines the kinetics of the model. The time evolu- 
tion was monitored by an artificially defined order param- 
eter Q in a way that Q cannot distinguish between the 
two states unambiguously (see dark blue line in Fig. ^ 
for a sample time series). Q was associated to an en- 
ergy function U\ = aQ 2 and U2 = ol(Q — l) 2 for the 
two states, respectively. As such, the first and sec- 
ond states preferentially visit different values of the or- 
der parameter. Within each state, the time evolution 
of Q was obtained by a conventional Metropolis crite- 
rion min [1, exp (— /3AU)] with j3 regulating the amount 
of fluctuations. Choosing a = 16, values of j3 close to 1 
suppress fluctuations while smaller values enhance them 
(most of the treatment below was done for the case of 
large fluctuations with f3 = 0.3). This model was used to 
generate order parameter time series of 10 5 steps. 



Molecular dynamics simulations 

Simulations of the (Gly-Ser)2 flexible linker peptide 
were performed using the all-atom CHARMM force- 
field version 27 [38j [39] as implemented in the program 
ACEMD [40]. All calculations were done on NVIDIA 
GTX680 graphics cards. The system was solvated into a 
box containing 1560 TIP3P waters. After equilibration, 
the molecular dynamics was run for 1/is in the NVT en- 
semble at 300K, using the Langevin algorithm. An inte- 
gration time step of 4 fs was used by rescaling the hydro- 
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FIG. 2. Theoretical two- state model, (a) The time series 
of the order parameter Q and the corresponding states ob- 
tained with the local-fluctuations analysis are shown in dark 
and light blue, respectively; (b) first-passage time distribu- 
tions obtained by different analysis techniques. The distribu- 
tion corresponding to the original two-state model is shown in 
dark blue. Distributions obtained from the local-fluctuations 
reduced kinetic model, a naive two-state model and along 
the original time series using Q = 0.5 as state separator are 
shown in light blue, yellow and red lines, respectively. The 
dependence of the mean-first-passage-time (mfpt) with the 
time window size r is shown in the inset (MCL p parame- 
ter of 1.3 and 1.4 for gray and dark gray points, respectively). 
The correct value of the mfpt and the time window value cho- 
sen for the analysis are shown as horizontal and vertical lines, 
respectively. 



gen mass to 4 amu together with mass repartitioning [41J. 
The radius of gyration was calculated with the program 
WORDOM [36 , 42 , neglecting all peptide hydrogens. 



RESULTS 

A two-state process with large fluctuations 

The protocol described in the Theory section was ap- 
plied to the time series of a generic order parameter 
Q with an underlying two-state dynamics. This model 
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served as a benchmark for the proposed network ap- 
proach since the underlying kinetics is known by con- 
struction. The amount of fluctuations of the order pa- 
rameter Q is controlled by a parameter j3 (see Methods). 
To characterize the two-state behavior directly from the 
time series a naive strategy would take the distribution 
of Q, looking for the minimum separating the two states. 
The value of Q at the minimum, 0.5 in this case, de- 
fines the separator between the states. If the fluctua- 
tions around the separator are small, Q is a good order 
parameter in the sense that the number of crossings of 
the separator represents a good estimate of the barrier 
between the two states. This approach is valid for the 
case of small fluctuations but breaks down as soon as the 
overlap between the states increases (i.e. large fluctua- 
tions). In the latter case, most of the contributions to 
the separator are coming from re-crossings: transitions 
passing the separator coming back very quickly to the 
initial state without reaching the end state. The origin 
of recrossings is mostly due to a sub-optimal choice of the 
order parameter rather than to a real physical property. 

The time series of Q for the case of large fluctuations 
(P = 0.3) was characterized with different approaches. 
A particular stringent test consists in the calculation of 
the first-passage-time (fpt) distribution to one of the two 
states. The correct distribution is shown as a dark blue 
line in Fig. [2]o (mean- fpt of 103.85 steps). This distribu- 
tion is greatly influenced by the definition of the target 
state. If the latter is correctly defined, the resulting dis- 
tribution overlap with the one calculated from the origi- 
nal two-state model. When the target state was chosen 
as Q < 0.5, a fpt distribution calculated along the trajec- 
tory resulted in a much faster kinetics (red line, Fig. [2^). 
With a mfpt of 7.56 steps, the kinetics obtained by this 
analysis was one order of magnitude faster with respect 
to the input model. 

A better description was obtained by analyzing the 
time series in terms of local-fluctuations and Markov- 
State-Models. The microstates were obtained by the 
protocol described in the Theory section (Fig. [TJ. Clus- 
terization of the resulting transition network resulted in 
the detection of two states. Being based on the order 
parameter fluctuations, these states are not strictly sep- 
arated in terms of Q as shown by their time series (light 
blue, Fig. [5|l). These two states were used to build a 
new Markov- State-Model with transition probabilities es- 
timated from the original transition network (see Meth- 
ods). In this case, the fpt distribution was estimated by 
generating a new time series of 10 6 steps from this model. 
Strikingly, the fpt distribution calculated on this new tra- 
jectory perfectly overlapped to the one generated by the 
original two-state model as shown by the light blue line 
of Fig. [2}d. Using a time window r = 30, the resulting 
mfpt is of about 114.58 steps, very close to the correct 
value of 103.85. This is not the case when a two-state 
model was built by estimating the transition probabil- 



ity from the number of times the separator Q = 0.5 was 
crossed. As expected, the fpt distribution calculated on 
the time series generated by this model leads to very poor 
results because the inter-state separator is dominated by 
recrossings (mfpt=5.2, yellow line, Fig. J2Jd). 

It is important to note that the two kinetic models pre- 
sented here (corresponding to the light blue and yellow 
lines) were built using the same starting information, i.e. 
the time series of the order parameter Q. Consequently, 
the improvement obtained by the local-fluctuations anal- 
ysis is purely due to the different strategy applied rather 
than the use of supplementary information. 

It is important to note that the predictions may de- 
pend on the choice of the time window r (see Theory). 
Caflisch and coworkers found a large range of validity 
for this parameter [33] . For the present two-state model, 
the time window range was evaluated against the mfpt 
to reach the target state (inset of Fig. |2Jd). For small 
time windows the value of the mfpt was smaller with re- 
spect to the correct one (horizontal line). This is due 
to the incorrect detection of states where large fluctua- 
tions were detected as real transitions. As the value of 
r was increased, the mfpt first converged to values close 
to the theoretical one (30 < r < 60) and then increased 
again. It was found that when the time window was too 
large, some transitions were missed, resulting in a overall 
slower kinetics. Consequently, the behavior of the mfpt 
as a function of r suggests a reasonable way to choose 
the time window as the location of the first slope change 
(r = 30, vertical line), just before the convergence region. 
Essentially identical results were found for the MCL pa- 
rameter p = 1.3 — 1.4 (inset of Fig. J2Jd). 

Multi-state dynamics of a disordered peptide 

GlySer peptides have been used in experiments for 
quite some time as flexible linkers [43] [44]. Short 
stretches of this peptide are interesting from a theo- 
retical point of view because they are computationally 
tractable, presenting non-trivial conformational disorder 
[20] [45] [46] . In this section the local- fluctuations analysis 
is applied to the dynamics of a (Gly-Ser)2 peptide. To 
this aim, a long molecular dynamics simulation of 1 /is 
was performed. It has been shown [46] that the radius 
of gyration R g qualitatively describes the conformational 
disorder of this peptide, suggesting the presence of mul- 
tiple states. A time series stretch and the distribution of 
the R g are shown as a dark blue line and a gray area in 
panel a and b of Fig. [3] respectively. 

Application of the local-fluctuations analysis on the 
R g time series revealed the presence of five major states 
(r = 300 ps and p — 1.3; these two values were chosen 
following the mfpt based strategy of the previous sec- 
tion). Fig.|3]3 shows the R g distribution of the five states 
(colored lines). Interestingly, the five distributions are 
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FIG. 3. The (Gly-Ser)2 peptide, (a) A time series stretch of the radius of gyration (R g , dark blue) and of the detected states 
after the local-fluctuations analysis (light blue); (b) Distribution of the radius of gyration. The distributions from the entire 
trajectory and for the five most populated states are shown as a gray area and colored lines, respectively, (c) Structural 
characterization of the three most compact states. For each of them, 50 random snapshots were overimposed. 



largely overlapping making their detection impossible by 
simply looking at the total distribution. In fact, the total 
distribution suggested no more than three states as indi- 
cated by the number of peaks (gray are in Fig. [3]o) . In- 
terestingly, the presence of the five states was already ap- 
parent in the raw R g time series (dark blue line, Fig. [3^l) . 
Those states became hidden in the total distribution due 
to the large fluctuations, reiterating the idea that using 
free-energy projections for the characterization of molec- 
ular processes can be ambiguous. 

From a structural point of view, all five states are 
well characterized. A molecular representation of the 
three most compact states is shown in Fig. [3J3. The 
most compact one, coded in light blue in Fig. [3)3, cor- 
responds to a loop-like structure typically found in /3- 
strands turns. This structure is stabilized by a hydrogen 
bond between the first backbone oxygen Oi and nitro- 
gen N4. This conformation has a population of around 
23% and R g ~ 3.1 A. The second state has a population 
of 22% and R g « 3.5 A ( green curve in Fig. 3p). Its 
topology is very similar to the turn-like structure but it 
is more disordered due to the formation of a non opti- 
mal backbone hydrogen bond. The third state instead is 
stabilized by the interaction of the side chain oxygen of 
SER2 with the backbone nitrogen N4 (population of 20% 



and R g ~ 3.7 A, yellow curve in Fig. [3)3). In this struc- 
ture, the side chain substitutes the backbone oxygen as 
a partner in the hydrogen bond, acting as a trap towards 
further compaction. Finally, the last two states (orange 
and red curves in Fig. ^jp) are rather unstructured, pro- 
viding similar realizations of almost completely extended 
conformations. 

To check whether the kinetics of the Markov- State- 
Model built on top of the five detected states reflected the 
same dynamics of the original trajectory, a fpt analysis 
was performed (Fig. [4|. In light blue, the fpt distribu- 
tion to the compact state was calculated on a trajectory 
originated from the Markov-State-Model. To compare it 
with the molecular dynamics simulation, the fpt to con- 
formations with R g = 3.16 ± 0.01 (the top of the first 
peak in the R g distribution) was calculated along the 
original molecular dynamics trajectory (dark blue line in 
Fig. [4]). This represents a good estimate of the fpt to the 
compact state at long times. Strikingly, the two curves 
nicely overlapped. An exponential fit of the data, i.e. 
~ exp(— t/t r ), showed a relaxation time of 6.5 ns in both 
cases. 

The fpt distribution calculated along the original tra- 
jectory using as target state all conformations with R g < 
3.4 (this is the value of the minimum of the R g dis- 
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time [ns] 

FIG. 4. The (Gly-Ser)2 first-passage-time distribution. Dis- 
tributions obtained from the local- fluctuations analysis, a 
naive two-state model and along the original time series 
are shown in light blue, yellow and red, respectively. For 
the latter two cases the target of the relaxation was R g < 
3.4. The relaxation kinetics to the compact state defined 
as the conformations belonging to the first peak of the R g 
(R g = 3.16 =b 0.01, see Fig. J3j)) is shown in dark blue. 

tribution) provided a much faster kinetics (red line in 
Fig. [4]). However, processes involving barrier crossings 
are expected to have the same relaxation kinetics when 
either the whole state is selected as a target or just the 
most probable conformation of the state (i.e. conforma- 
tions at the peak of the order parameter distribution) 
[19]. Consequently, the disagreement between the fpt 
distribution to R g < 3.4 and to a more stringent target 
(i.e. R g = 3.16 ± 0.01), provides strong evidence that 
the correct estimation of the kinetics cannot be obtained 
directly from the R g distribution due to the presence of 
recrossings. As expected, this becomes even worse when 
the Markov- State-Model is built directly from the R g dis- 
tribution, i.e. by choosing as states the portions of the 
distribution separated by minima and as transition prob- 
abilities the number of times the minima were crossed 
(yellow line in Fig. |4|. 

DISCUSSION 

Nowadays, molecular dynamics simulations easily pro- 
duce tera scale of data. As such, an important bottle- 
neck in the understanding of molecular processes is in 
the analysis rather then the data generation per se. 

In this work, a strategy for the analysis of conventional 
order parameters time series that is kinetics compliant 
was investigated. Our results provided strong evidence 
that the coupling of order parameter fluctuations with 



complex network analysis represents a powerful approach 
to deconvolute crowded order parameter distributions of 
molecular systems. This procedure allows the construc- 
tion of kinetically accurate Markov- State-Models in a 
natural and intuitive way, largely overcoming the prob- 
lems raised by conventional order parameter analysis. 
Moreover, a wide range of experimentally generated time 
traces coming from FRET or optical tweezers, can be 
readily tackled by this methodology. 

Taking into account the fluctuations within a time- 
window r, the approach is able to distinguish between 
snapshots belonging to different states but having the 
same value of the order parameter. Towards an accurate 
characterization of the kinetics the value of r needs to be 
chosen appropriately. We proposed to estimate it on the 
basis of a mean- first-passage-times analysis. Very similar 
in spirit to what Caflisch and collaborators proposed in 
their work [33], this procedure is more suitable to our 
network clusterization approach. 

Finally, it is worth mentioning that instead of look- 
ing at better ways to analyze conventional order param- 
eters time series, some groups focused their attention on 
the development of optimal reaction coordinates [47H49] . 
These abstract coordinates aim to correctly characterize 
the molecular kinetics. Among them, one method based 
on cut-based free-energy profiles seems very promising 
[49] . In this approach, the coefficients of a linear com- 
bination of physical distances are optimized against the 
cut-based free-energy profile. At the end of the process, 
the combination which maximizes the barrier height with 
respect to a target state provides the optimal reaction co- 
ordinate. A fundamental difference between this method 
and the local-fluctuations analysis is that the latter re- 
quires the time evolution of a single (non-optimal) coor- 
dinate while the optimization procedure makes use of a 
very large number of physical distances (usually around 
thousands [14, 49, 50] ) to perform properly. 
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